Session 3: Statistical modeling and machine learning
Department of Econometrics and Business Statistics
Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair treatment of the servers, for whom tips (at least in restaurants in the United States) are a major component of pay.
In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990. The restaurant, located in a suburban shopping mall, was part of a national chain and served a varied menu. In observance of local law the restaurant offered seating in a non-smoking section to patrons who requested it. Each record includes a day and time, and taken together, they show the server’s work schedule.
What is \(y\)? What is \(x\)?
Every person monitored their email for a week and recorded information about each email message; for example, whether it was spam, and what day of the week and time of day the email arrived. We want to use this information to build a spam filter, a classifier that will catch spam with high probability but will never classify good email as spam.
What is \(y\)? What is \(x\)?
A health insurance company collected the following information about households:
The health insurance company wants to provide a small range of products, containing different bundles of services and for different levels of cover, to market to customers.
What is \(y\)? What is \(x\)?
All (data-centric) models have a fitted values and residuals.
\[y = f(x_1, x_2, ..., x_p) + \varepsilon\]
where \(y\) is the observed response, \(x_1, ..., x_p\) are the observed values of \(p\) predictors and \(\varepsilon\) is the error. We conventionally use \(n\) to specfify the sample size.
Predictive accuracy
The primary purpose is to be able to predict \(\widehat{Y}\) for new data. And we’d like to do that well! That is, accurately.
Interpretability
Almost equally important is that we want to understand the relationship between \({\mathbf X}\) and \(Y\). The simpler model that is (almost) as accurate is the one we choose, always.
Person: Why did you predict 42 for this value?
Computer: Awkward silence
Parametric methods
Non-parametric methods
Black line is true boundary.
Grids (right) show boundaries for two different models.
If the model form is incorrect, the error (solid circles) may arise from wrong shape, and is thus reducible. Irreducible means that we have got the right model and mistakes (solid circles) are random noise.
Parametric models tend to be less flexible but non-parametric models can be flexible or less flexible depending on parameter settings.
Bias is the error that is introduced by modeling a complicated problem by a simpler problem.
Variance refers to how much your estimate would change if you had different training data. Its measuring how much your model depends on the data you have, to the neglect of future data.
When you impose too many assumptions with a parametric model, or use an inadequate non-parametric model, such as not letting an algorithm converge fully.
When the model closely captures the true shape, with a parametric model or a flexible model.
This fit will be virtually identical even if we had a different training sample.
Likely to get a very different model if a different training set is used.
When data are reused for multiple tasks, instead of carefully spent from the finite data budget, certain risks increase, such as the risk of accentuating bias or compounding effects from methodological errors. Julia Silge
[1] "A" "A" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
[1] "B" "B" "A" "B" "B" "A" "B" "B"
[1] "B" "B" "B" "B"
Compute \(\widehat{y}\) from training data, \(\{(y_i, {\mathbf x}_i)\}_{i = 1}^n\). The error rate (fraction of misclassifications) to get the Training Error Rate
\[\text{Error rate} = \frac{1}{n}\sum_{i=1}^n I(y_i \ne \widehat{y}({\mathbf x}_i))\]
A better estimate of future accuracy is obtained using test data to get the Test Error Rate.
Training error will usually be smaller than test error. When it is much smaller, it indicates that the model is too well fitted to the training data to be accurate on future data (over-fitted).
| predicted | |||
1
|
0
|
||
| true |
1
|
a
|
b
|
0
|
c
|
d
|
|
Consider 1=positive (P), 0=negative (N).
adc (Type I error)b (Type II error)a/(a+b)d/(c+d)a+b)/(a+b+c+d)a+d)/(a+b+c+d)Two classes
# A tibble: 2 × 4
# Groups: y [2]
y bilby quokka cl_acc
<fct> <int> <int> <dbl>
1 bilby 9 3 0.75
2 quokka 5 10 0.667
More than two classes
# A tibble: 3 × 5
# Groups: y [3]
y bilby quokka numbat cl_err
<fct> <int> <int> <int> <dbl>
1 bilby 9 3 0 0.75
2 numbat 0 2 6 0.75
3 quokka 5 10 0 0.667
The balance of getting it right, without predicting everything as positive.
Need predictive probabilities, probability of being each class.
The first thing to do with data is to look at them …. usually means tabulating and plotting the data in many different ways to see what’s going on. With the wide availability of computer packages and graphics nowadays there is no excuse for ducking the labour of this preliminary phase, and it may save some red faces later.
Crowder, M. J. & Hand, D. J. (1990) “Analysis of Repeated Measures”
Creating new variables to get better fits is a special skill! Sometimes automated by the method. All are transformations of the original variables. (See tidymodels steps.)
step_pca)step_log)step_ratio)step_ns)step_dummy)Compute and examine the usual diagnostics, some methods have more
Go beyond … Look at the data and the model together!
Training - plusses; Test - dots
We plot the model on the data to assess whether it fits or is a misfit!
Doing this in high dimensions is considered difficult!
So it is common to only plot the data-in-the-model-space.
NOTE, WE CAN PLOT THESE THINGS IN HIGH DIMENSIONS. But this is for another day.
Predictive probabilities are aspects of the model. It is useful to plot. What do we learn here?
But it doesn’t tell you why there is a difference.
Model is displayed, as a grid of predicted points in the original variable space. Data is overlaid, using text labels. What do you learn?
One model has a linear boundary, and the other has the highly non-linear boundary, which matches the class cluster better. Also …
Data looks uninteresting: weak linear relationships between response and predictors.
x <- lm(tip ~ total_bill, data = tips)
tips.reg <- data.frame(tips, .resid = residuals(x),
.fitted = fitted(x))
library(ggplot2)
ggplot(lineup(null_lm(tip ~ total_bill,
method = 'rotate'),
tips.reg)) +
geom_point(aes(x = total_bill,
y = .resid)) +
facet_wrap(~ .sample) +
theme(axis.text = element_blank(),
axis.title = element_blank())The package broom gets model results into a tidy format at different levels. (And broom.mixed does this for mixed models.)
broom::glance (values such as AIC, BIC, model fit, …)broom::tidy (estimate, confidence interval, significance level, …)broom::augment (fitted values, residuals, predictions, influence, …)# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.220 0.218 1.00 93.8 5.52e-104 6
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
# deviance <dbl>, df.residual <int>, nobs <int>
# A tibble: 7 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.0125 0.0224 -0.555 5.79e- 1
2 x1 1.03 0.126 8.17 5.59e-16
3 x2 0.460 0.268 1.72 8.58e- 2
4 x3 0.622 0.232 2.69 7.29e- 3
5 x4 0.965 0.820 1.18 2.39e- 1
6 x5 0.968 0.192 5.05 4.81e- 7
7 x6 0.930 0.117 7.94 3.40e-15
tidymodelstidymodels is a set of R packages that make it easier to build, tune, and evaluate statistical and machine learning models — all using the same consistent, tidyverse-style syntax. It provides a framework for the modeling workflow in R:
recipes)parsnip)rsample)tune)yardstick, workflows, broom)Source: A predictive modeling case study
About the data: The hotel bookings data from Antonio, Almeida, and Nunes (2019) to predict which hotel stays included children and/or babies, based on the other characteristics of the stays such as which hotel the guests stay at, how much they pay, etc.
Model: We will build a model to predict which actual hotel stays included children and/or babies, and which did not. Our outcome variable children is a factor variable with two levels.
Rows: 50,000
Columns: 23
$ hotel <fct> City_Hotel, City_Ho…
$ lead_time <dbl> 217, 2, 95, 143, 13…
$ stays_in_weekend_nights <dbl> 1, 0, 2, 2, 1, 2, 0…
$ stays_in_week_nights <dbl> 3, 1, 5, 6, 4, 2, 2…
$ adults <dbl> 2, 2, 2, 2, 2, 2, 2…
$ children <fct> none, none, none, n…
$ meal <fct> BB, BB, BB, HB, HB,…
$ country <fct> DEU, PRT, GBR, ROU,…
$ market_segment <fct> Offline_TA/TO, Dire…
$ distribution_channel <fct> TA/TO, Direct, TA/T…
$ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0…
$ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0…
$ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0…
$ reserved_room_type <fct> A, D, A, A, F, A, C…
$ assigned_room_type <fct> A, K, A, A, F, A, C…
$ booking_changes <dbl> 0, 0, 2, 0, 0, 0, 0…
$ deposit_type <fct> No_Deposit, No_Depo…
$ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0…
$ customer_type <fct> Transient-Party, Tr…
$ average_daily_rate <dbl> 81, 170, 8, 81, 158…
$ required_car_parking_spaces <fct> none, none, none, n…
$ total_of_special_requests <dbl> 1, 3, 2, 1, 4, 1, 1…
$ arrival_date <date> 2016-09-01, 2017-0…
Data splitting and re-sampling
Reserve 25% of the stays to the test set. The outcome variable children is pretty imbalanced so use a stratified random sample.
# A tibble: 2 × 3
children n prop
<fct> <int> <dbl>
1 children 3046 0.0812
2 none 34454 0.919
# A tibble: 2 × 3
children n prop
<fct> <int> <dbl>
1 children 992 0.0794
2 none 11508 0.921
Data splitting and re-sampling
The “other” set is further divided into training and validation sets. These two are used to build the model.
The test set is only used when the final model is decided.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.